-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use pyarrow dtype_backend #1781
Conversation
I'd like to see some extra tests with adding/setting a default pandas dataframe, without the correct dtypte and time. edit: Basically trying to break it ;) |
Tested it on De Dommel workflow and get a model that gets written with the Python-API and read and simulated without ValidationErrors or Exceptions. Didn't test/check any results, but the model + results are here: https://we.tl/t-Uxk9xra8i1 However, I do repeatedly get this FutureWarning which is messing up my logging quite alot:
|
python/ribasim/ribasim/input_base.py
Outdated
df = gpd.read_file(path, layer=table, fid_as_index=True) | ||
df = pyogrio.read_dataframe( | ||
path, | ||
layer=table, | ||
fid_as_index=True, | ||
use_arrow=True, | ||
arrow_to_pandas_kwargs={"types_mapper": pd.ArrowDtype}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only pyogrio supported this directly, geopandas not yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best to place an inline comment to the specific kwarg that's not supported yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I misunderstood that all extra kwargs were passed from gpd.read_file
to pyogrio.read_dataframe
if engine="pyogrio"
, so I went back to gpd.read_file
and added a comment on arrow_to_pandas_kwargs
, which ends up in pyarrow
, not pyogrio
.
time=pd.date_range(start="2020-01-01", end="2021-01-01", periods=100), | ||
time=pd.date_range( | ||
start="2020-01-01", end="2021-01-01", periods=100, unit="ms" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needed unit="ms"
otherwise the nanosecond precision times get a ValidationError now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we automatically convert such timestamps? I fear that users might hit this and have to find this fix manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only point where we got nanosecond precision was with the data_range
with periods pattern. I feel like this pattern will rarely be used for real models, maybe only synthetic ones like our docs. I don't think I've ever come across hydrological data with nanosecond precision.
Nanosecond units will automatically convert to millisecond as long as it doesn't lose data. The ValidationError is also quite good. I prefer not to throw a warning on this as it usually doesn't matter. If you know an easy way to round or truncate to the nearest millisecond I'd be ok with that as well. But I doubt many users will run into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, but I have my doubts on how we handle the defaults of pandas (both for time as the arrow dtype).
In terms of dtype, I think it's automatically converted, so we should be fine there. However, in terms of time I'd rather we convert (maybe with a warning), rather than erroring out. That still seems an improvement over silently dropping precision. That said it's mostly a usability concern for me here.
python/ribasim/ribasim/schemas.py
Outdated
node_id: Series[Int32] = pa.Field(nullable=False, default=0) | ||
area: Series[float] = pa.Field(nullable=False) | ||
level: Series[float] = pa.Field(nullable=False) | ||
node_id: Series[Annotated[pd.ArrowDtype, pyarrow.int32()]] = pa.Field(nullable=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have lost the defaults here, was that on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 0 was a dummy-default since np.int32
doesn't support missings, but pyarrow.int32
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well its also because the node_id is a required field and can't be missing schema wise?
Fixes #1721
Fixes #1394
The
dtype_backend
part of this is not breaking. The only part that is technically breaking is that we specify a unit of milliseconds for the Arrow time type. Previously we used the default nanosecond precision, which was then truncated to milliseconds in the core. I think it is better to disallow precision higher than milliseconds if we cannot distinguish them in the core.